Semantic Harmonization: From the Mars Climate Orbiter to Healthcare

Thursday, May 27, 2021

Semantic harmonization, the basic concept that multiple parties agree on the meaning of information, is essential for any mission that relies on collaboration. In 1998, NASA launched its Mars Climate Orbiter, a $125 million project designed to explore the Martian atmosphere and climate, and to later serve as a communications relay for the Mars Polar Lander. After almost 10 months of space travel the orbiter arrived at its final destination, but instead of taking its intended place in the orbit, it detached from its path, shattered into pieces, and burned. Every detail of this highly anticipated mission was carefully calculated. Yet something went terribly wrong. An investigation later revealed that the two organizations collaborating did not exactly speak the same language but were unaware that a translation was needed: Jet Propulsion Laboratory’s navigation software was relying on the metric system, while Lockheed Martin’s software was doing calculations based on the English system. This failure could have been avoided if the two systems had focused on the meaning of the information rather than the numerical values. Today, there are still many organizations struggling to make use of each other’s data, many of which are in healthcare.

Despite wide implementation of Electronic Health Records (EHR) by healthcare providers since the introduction of HITECH Act in 2009, digital innovation has lagged behind. When asked about factors delaying innovation in healthcare, stakeholders often cite lack of interoperability between health systems as the main problem. Lack of metadata, varying standards used by different health systems, or same standards applied differently even within a single system are commonly cited obstacles for interoperability, for improved care, and quality research. The same obstacles also impact capture and integration of other health related data (e.g., environmental or occupational exposures, social determinants of health, or data from wearable devices). As a result, large amounts of valuable data are stored in separate system silos available to only for local use.

To overcome this semantic disarray, Cloud Privacy Labs developed the Layered Schema Architecture (LSA), an open-source privacy-conscious semantic interoperability solution for the capture, processing, and exchange of data.

Schemas are abstract designs that describe the shape of data and the relationships between data elements. Traditional schemas deal with the validity of data in form, not in its meaning. Layered schemas differ from traditional schemas by adding multi-dimensional open-ended annotations that provide semantic information. For example, in a traditional schema, a “Patient” can be defined as an object consisting of a number of data fields, but there is no way to represent that the patient may also be a “Research Participant” in another context. A layered schema can add such contextual information using interchangeable overlays. These overlays deal with variability related to context and differences in data capture conventions, jurisdictions, locale, vendors, or many other factors.

Key features of the Layered Schema Architecture

Semantic Harmonization and Interoperability

Semantic harmonization refers to the process of combining disparate sources and representations of data into a common form so that items of data share meaning. Currently, the aggregation, standardization, and reuse of existing electronic health data is a major challenge for data warehouses and for clinical research organizations. Semantic harmonization has to take into account the context in which data are captured as well as the context in which data will be used. One common obstacle is the difficulty in interpreting data based on context. Layered Schema Architecture helps overcome this obstacle by providing tools to enrich data with contextual metadata, by enabling semantic harmonization, and ultimately facilitating rapid construction of data sets available for different use cases (e.g., data for clinical research or AI training).

Standardized Management of Metadata

The Layered Schemas Architecture offers a mechanism to manage metadata as separate layers. These layers can include a wide variety of information (data source, context on data capture, etc.) and can utilize emerging standard ontologies for metadata management. Multiple overlays can be used to include format constraints, local language, privacy classifications, data retention policies, or provenance information.

Linked Data

The Layered Schema Architecture describes an abstract data model in the form of linked data, with interchangeable overlays that add metadata to account for contextual information, as well as variations due to jurisdiction, local conventions, Application Programming Interface (API) vendors, or personal preferences. Data coming from heterogeneous sources that use different conventions and representations can be enriched with source-specific metadata and mapped to a common information model.

Integration of EHR and Other Data

Layered schemas can be used to integrate and harmonize non-standard heterogeneous data that impact health outcomes but are coming from outside traditional healthcare settings. These can be data on social determinants of health, data from wearable devices describing lifestyle patterns, or information on an individual’s environmental or occupational exposures. LSA can be used to map vendor-specific data elements into a common data vocabulary.

Improved Patient Trust and Engagement

Layered Schema Architecture enables implementation of machine-readable Data Use Agreements (DUA) that incorporate user consent. A schema including layers to mark certain data elements with privacy classifications can be associated with DUAs and/or user consent to remove or mask certain attributes during data exchange. Such privacy layers can be linked to patient-centered consent management for research, where potential participants can easily authorize information disclosure and consent for participation. Layered schemas can also be used for automated generation of multi-lingual data entry interfaces using language specific layers that can be used for research subject recruitment and informed consent capture, thus potentially increasing representation of diverse groups in research.

Applications of Layered Schema Architecture

Layered schemas can be useful in any area that relies on collaboration of different partners capturing data from multiple sources with diverse goals. Semantic harmonization might have saved the Mars Orbiter. Two decades have passed since this highly visible example, yet many sectors still experience discordance of information as a major obstacle for progress and innovation, such as in healthcare and public health. Semantic harmonization is particularly important for data warehouse operations that pool data from disparate sources, health information exchanges (HIEs) that facilitate data exchange between providers with varying implementations, and consumer applications that aggregate data from multiple sources.

Layered Schema Architecture for data warehouses

EHR data can be highly variable between different sources and even within a single EHR system, due to the differences between vendors, data models, the underlying data codification approaches, or institutional conventions. In a traditional data warehousing operation, customized Extraction-Transformation-Loading (ETL) pipelines are the key for semantic harmonization. However, many times these data transformations lead to data loss, do not scale well, and require many customizations that are specific to a particular problem, location, or data source.

Layered Schema Architecture is a scalable semantic harmonization solution for data warehousing applications. With LSA, a core set of schema layers are used to describe the common attributes of the captured data, while data-source specific overlays add semantic information describing particular variations. The captured data elements are translated into linked data format and annotated with semantic tags using a set of overlays tailored for the data source. The annotated linked data representation can be transformed into a common data model or used directly, such as into a training data set for an AI application using the source data and the metadata associated with it.

Layered Schema Architecture for privacy-conscious data exchange for HIEs or consumer applications

Layered schemas can be used to implement machine-readable data use and exchange agreements. A set of overlays can be used to tag data elements with different privacy levels and later these data elements can be redacted based on user consent and data exchange agreements. An important use case is the exchange of sensitive health data (e.g., mental health, substance use data) where privacy attributes control who can access what data, when, and for what purpose. In a health information exchange setting, layered schemas can be used to both tag sensitive information, and to redact tagged information based on patient consent. Similarly in a patient-centric health data wallet, the wallet application can use different sets of overlays to encode different data exchange scenarios based on patient input.